Vehicle Loan Default Prediction

Submitted by :

Group 45
Akshay Prasannan - S3818611
Amal Gopidasan Nair - S3820786
Joel Varghese Cherian - S3808033

Table of Content

  1. Introduction
  2. Report Overview
    2.1 Objective
    2.2 Dataset
    2.3 Descriptive Feature
    2.4 Target Feature
    2.5 Overview of Methodology
  3. Data Preprocessing
    3.1 Importing Dataset
    3.2 Dropping Columns
    3.3 Missing Values
    3.4 Renaming columns and Changing Data type
    3.5 Data transformation
    3.6 Encoding Categorical Features
  4. Predictive Modeling
    4.1 Scaling
    4.2 Feature Selection
    4.3 Data Sampling and Test-Train Split
  5. Model Fitting and Tuning
    5.1 Preliminaries: Model evaluation strategy
    5.2 KNN
    5.2.1 Hyper Parameter Tuning
    5.2.2 Final Model
    5.2.3 Performance Check
    5.3 Decision Tree
    5.3.1 Hyper Parameter Tuning
    5.3.2 Final Model
    5.3.3 Performance Check
    5.4 Naive Bayes
    5.4.1 Hyper Parameter Tuning
    5.4.2 Final Model
    5.4.3 Performance Check
    5.5 Bagging
    5.5.1 Hyper Parameter Tuning
    5.5.2 Final Model
    5.5.3 Performance Check
    5.6 RandomForest
    5.6.1 Hyper Parameter Tuning
    5.6.2 Final Model
    5.6.3 Performance Check
  6. Model Comparison & Performance Evaluation
    6.1 AUC-ROC Curve of Predictive Model
    6.2 Accuracy Score
    6.3 Classification Report
    6.4 Cross Validation & T-Test
  7. Imbalance Problem
  8. Critique & Limitation
  9. Summary
    9.1 Project Summary
    9.2 Summary of Findings
  10. Conclusion
  11. Reference

Introduction

In this present world where most people run the show with the help of various credits/loans provided by banks and other financial institutions. One major sector where a lot of credits or loans are involved is the automobile sector. Nowadays the vehicle loan defaults are on the higher side causing substantial loss to the financial institutions. As a result of which the loan underwriting has become more stringent and rejections due to the bad profile of the customer have increased due to this. The demand for an effective and efficient credit risk scoring model has increased among financial organizations.

Report Overview

Objective

The objective of this project is to accurately model and predict the risk of a borrower defaulting on a vehicle loan in the first EMI( Equated monthly installment) on the due date. The project is divided into 2 phases. And in this report is on the Phase 2 Objective.

Phase 1 Objective

Phase 2 Objective

Dataset

The original dataset is from L&T Financial Services & Analytics Vidhya presents ‘DataScience FinHack’ competition and we have taken it from https://www.kaggle.com/lampubhutia/loandefault-ltfs-avml-finhack . There were 2 datasets in the website, 'train.csv', and 'test.csv'. Train.csv is downloaded and renamed as loandefault.csv. We are considering only the train.csv in this project phase as the dataset and for data splitting, we would split the loandefault.csv in phase 2 as required. The dataset consists of 41 attributes in total, of which 40 are descriptive features and one target feature.

Descriptive Features

Variable Description Data_Type Unit
UniqueID Unique ID for identifying Customers. Nominal category Not Applicable
disbursed_amount Loan Amount disbursed Numerical Indian Currency Rupees
asset_cost The cost of asset Numerical Indian Currency Rupees
ltv Loan to Value of the asset Numerical Percentage
branch_id The branch where the loan was disbursed Nominal category Not Applicable
supplier_id Vehicle dealer ID where loan was disbursed Nominal category Not Applicable
manufacturer_id Vehicle manufacturer(Hero, Honda, TVS etc.) Nominal category Not Applicable
Current_pincode Current pincode of the customer Nominal category Not Applicable
Date.of.Birth Date of birth of the customer Numerical DD/MM/YYY
Employment.Type Employment Type of the customer (Salaried/Self... Nominal category Not Applicable
DisbursalDate Date of disbursement Numerical DD/MM/YYY
State_ID State of disbursement Nominal category Not Applicable
Employee_code_ID Employee of the organization who logged the di... Nominal category Not Applicable
MobileNo_Avl_Flag Flagged as 1 if the mobile number is shared by... Nominal category - Binary Not Applicable
Aadhar_flag Flagged as 1 if the mobile number is shared by... Nominal category - Binary Not Applicable
PAN_flag Flagged as 1 if the PAN number is shared by th... Nominal category - Binary Not Applicable
VoterID_flag Flagged as 1 if the Voter ID is shared by the ... Nominal category - Binary Not Applicable
Driving_flag Flagged as 1 if the Driving license is shared ... Nominal category - Binary Not Applicable
Passport_flag Flagged as 1 if the Passport is shared by the ... Nominal category - Binary Not Applicable
PERFORM_CNS.SCORE Bureau Score Numerical Not Applicable
PERFORM_CNS.SCORE.DESCRIPTION Bureau score description Ordinal Category Not Applicable
PRI.NO.OF.ACCTS Total number of loan taken by the customer at ... Numerical Not Applicable
PRI.ACTIVE.ACCTS Total number of active loan at time of disburs... Numerical Not Applicable
PRI.OVERDUE.ACCTS Total number of default accounts at the time o... Numerical Not Applicable
PRI.CURRENT.BALANCE Total principal outstanding amount of the acti... Numerical Indian Currency Rupees
PRI.SANCTIONED.AMOUNT Total amount that was sanctioned for all the l... Numerical Indian Currency Rupees
PRI.DISBURSED.AMOUNT Total amount that was disbursed for all the lo... Numerical Indian Currency Rupees
SEC.NO.OF.ACCTS Total number of loan taken by the customer at ... Numerical Not Applicable
SEC.ACTIVE.ACCTS Total number of active loan at time of disburs... Numerical Not Applicable
SEC.OVERDUE.ACCTS Total number of default accounts at the time o... Numerical Not Applicable
SEC.CURRENT.BALANCE Total principal outstanding amount of the acti... Numerical Indian Currency Rupees
SEC.SANCTIONED.AMOUNT Total amount that was sanctioned for all the l... Numerical Indian Currency Rupees
SEC.DISBURSED.AMOUNT Total amount that was disbursed for all the lo... Numerical Indian Currency Rupees
PRIMARY.INSTAL.AMT EMI Amount of the primary loan Numerical Indian Currency Rupees
SEC.INSTAL.AMT EMI Amount of the secondary loan Numerical Indian Currency Rupees
NEW.ACCTS.IN.LAST.SIX.MONTHS New loans taken by the customer in last 6 mont... Numerical Not Applicable
DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS Loans defaulted in the last 6 months Numerical Not Applicable
AVERAGE.ACCT.AGE Average loan tenure Numerical MM/YY
CREDIT.HISTORY.LENGTH Time since first loan Numerical MM/YY
NO.OF_INQUIRIES Enquries done by the customer for loans Numerical Not Applicable
loan_default whether payment default on the first EMI. 0:no... Nominal category - Binary Not Applicable

Target Feature

The target feature is :
loan_default : It is payment default on the first EMI. The target feature has two levels, 0 and 1.

Overview of Methodology

Overall approach towards building an accurate predictive modeling for this report are as follows:

Data Preprocessing

Importing Dataset

Dropping columns

Missing values

Renaming columns and Changing Data type

Data Transformation

Encoding Categorical Features

Predictive Modeling

We perform a set of preliminary activities before modeling.

Scaling

Machine learning algorithm works better when the feature value are on a similar scale. Hence values of the features are scaled between the range 0 and 1 without changing the shape of the distribution using the min-max scaler from Scikit-Learn library.

Scaled Value $ = \frac{X − X min}{ X max − X min} $

Feature Selection

The dataset Data consist of a 33 features. Using all the descriptive features might lead to over fitting, computational times and poor performance. Feature selection is a practice of selecting the best features from the dataset which can efficiently and effectively performs with the selected machine learning models. Here, the Random Forest Importance is used to select top 15 features for prediction.

From the above figure it can be observed that top the most important feature for modeling are:

The selected Features are stored as a data frame, Data_best.This is further used for prediction.

Data Sampling and Test-Train Split

The original dataset consist of 225493 rows. For this report a random sample of 100000 are considered. Additionally the dataset are separated into Data_sample and Target_sample.
The new datasets Data_sample and Target_sample are split into train and test data. 70% is used for training the models and 30% is used for testing.

Model Fitting & Tuning

Preliminaries: Model evaluation strategy

KNN

Hyper Parameter Tuning

Two arguments are set for modeling K-Nearest Neighbor, they are:

Hyper Parameter tuning is performed with with different values of n_neighbor and p to get select the best parameters for the decision tree.

The values considered are as follows:

The above output suggests that the best model is derived with the following parameter:

From the above graph we can observe that as the K value increases the Manhattan mean CV score rises. Euclidean distance score also seems to be rising but below the Manhattan distance line.

Final Model

Performance Check

Decision Tree

Hyper Parameter Tuning

To model the decision tree three arguments are set:

Hyper Parameter tuning is performed with with different values of criterion, Max depth and min_samples_split to get select the best parameters for the decision tree.

The values considered are as follow:

The above output of Hyper Parameter Tuning suggest that the parameters for the Decision tree are:

The figure above shows the Mean CV score value at various Max depth value. The red line represents Gini Index while blue line represents entropy criterion.

As it can be seen in the figure, gini index line rises above the entropy line till the max depth value is 5 while the entropy line continues continues to rise from max depth value 5. The entropy line reaches the maximum Mean CV Score at max depth value 6.

Final Model

Final model is developed with the parameters derived from the Hyperparameter tuning

Performance Check

Naive Bayes

A probabilistic machine learning model called a Naive Bayes classifier is used to perform classification tasks. The Bayes theorem is at the heart of the classifier.

For each class, Naive Bayes predicts membership probabilities, such as the likelihood that a given record or data point belongs to that class. The most probable class is described as the one with the highest probability.

Hyper Parameter Tuning

Final Model

Performance Check

Bagging

Each model in the ensemble is trained on a random sample of the dataset known as bootstrap samples when we use bagging (or bootstrap aggregating). Bagging is a machine learning ensemble meta-algorithm for increasing the accuracy and stability of machine learning algorithms used in statistical regression and classification.
Bagging and boosting are the two primary types of ensemble machine learning. Both regression and statistical classification can benefit from the bagging strategy. Bagging is used with decision trees to increase model stability in terms of reducing variance and enhancing accuracy, which avoids the problem of over fitting.
We determine the optimal value of number of estimations.

Hyper Parameter Tuning

Final Model

Performance Check

RandomForest

Hyper Parameter Tuning

To model the Random Forest, two arguments are set, they are

Hyper Parameter tuning is performed with with different values of criterion and n_estimators to get select the best parameters for Random Forest.

The values considered are as follow:

The above output after tuning suggest the optimum parameters for best models are:

with best score 0.5945758783380247

The figure above shows the Mean CV score value at various n-estimators. The red line represents Gini Index while blue line represents entropy criterion.

As it can be seen in the figure, entropy line rises above the Gini index line as the n estimator increases while the the gini index rises below the entropy line with less Mean CV score as compared to entropy criterion

Final Model

Performance Check

Model Comparison and Performance Evaluation

AUC- ROC Curve of Predictive Models

The above figure displays the AUC-ROC curve and scores of each selected predictive models.The ROC plot suggests that ROC curve of each model is just above the random prediction line although, among all the selected models Decision Tree performs the best with an AUC score of 0.607 each.

Accuracy Score

Above output suggest that the decision tree scored the best out of all the other models with about 78.38 % accuracy followed by a close score of 78.34% and 77.44% by KNN Model and Random Forrest Model respectively.

Classification Report

We use the following metrics to measure the performance of the models with the Test dataset:

All models have a F1 Score greater than 0.85 which is a good sign. As we check for a balance between precision and recall this F1 Score looks good.

Model Comparison

Here we will compare the models on the basis of loan default prediction (prediction of '1'). Precisely, the comparison will be performed on the basis of Precision, Recall value, F1-Score,overall accuracy of the model and overall ROC-AUC curve score.

The plot shows reading for the target variable .i.e. 1 which means the customer is a defaulter.

Cross Validation and T-Test

The p value less than 0.05 shows it is statistically significant. Here from the results only KNN and DT have p value greater than 0.05 indicating that at 95% significance level, the difference between the score even though too small is not statistically significant and the performance is comparable. Rest all models when compared to each other have p value less than 0.05 and the difference is statistically significant.

Imbalance Problem

We check and understand that there is an imbalance with the Target value with 78% of the variable being the negative attribute.

Critique & Limitations

The aim of this project was to try and test five supervised machine learning models to predict loan default made by a borrower at first EMI.
The considered Models were K- Nearest Neighbor, Naive Bayes, Decision Tree, Bagging and Random Forest.

Summary

Project Summary

Phase 1

In Phase 1 of the project we carried out Data Preprocessing and Visualization. The dataset was made ready for further modeling and prediction in Phase 2. The unnecessary attributes where removed from the dataset and necessary preprocessing and transformation was done on the required attributes. The relationship between features were explored through visualization.

Phase 2

Phase 2 of this project focused on building supervised machine learning models for predicting the loan default made by a borrower. The following classification models where designed, tuned and tested :

Performance comparison by ROC curve and the F1 Score were performed to assess the best model.

Summary of Findings

From all the test we can conclude that the Decision Tree Model with 15 features has the highest ROC_AUC with the test dataset and while checking the ROC_AUC score for the train dataset Decision Tree model itself has the highest score and out performs all other models. We also observe that the F1 score for all the models are similar for the Target variable 0(non-defaulters) and for defaulters, 1 NB model has the highest f1 score. Further model deployment can be done using the Decision Tree model. We are considering only 15 features here which can be changed with other feature selection methods to find better conclusions.

Conclusion

From the modeling and fitting we can conclude that Decision Tree is the best model for this dataset comparing the other models taken in the project. We do feature selection and use the 15 features selected to model and fit the dataset. we can use the Decision tree algorithm for further deployment of the dataset which is not in the scope of this project and can be done in future. Hence we can conclude that using the Decision Tree algorithm we can predict if a borrower will default on a vehicle loan in the first EMI( Equated monthly installment) on the due date.

Reference

  1. LoanDefault_LTFS_AV(ML_FINHACK). (2021). Retrieved 10 April 2021, from https://www.kaggle.com/lampubhutia/loandefault-ltfs-avml-finhack
  2. Aksakalli, D., n.d. Case Study: Predicting Income Status | www.featureranking.com. [online] www.featureranking.com. Available at: https://www.featureranking.com/tutorials/machine-learning-tutorials/case-study-predicting-income-status/#7 [Accessed 29 May 2021].
  3. www.featureranking.com. 2021. Feature Selection and Ranking in Machine Learning | www.featureranking.com. [online] Available at: https://www.featureranking.com/ [Accessed 29 May 2021].
  4. dataprofessor/code. (2021). Retrieved 2 June 2021, from https://github.com/dataprofessor/code/blob/master/python/ROC_curve.ipynb [Accessed 2 June 2021].